Notes: See blog post for this rmd explanation: https://solomonmessing.wordpress.com/2014/01/19/visualization-series-the-scatterplot-or-how-to-use-data-so-you-dont-get-ripped-off/ ***
library('ggplot2')
data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
qplot(x = carat, y = price, data = diamonds,
xlim = c(0, quantile(diamonds$carat,0.99)),
ylim = c(0, quantile(diamonds$price, 0.99)))+
geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 926 rows containing missing values (geom_point).
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21) +
#linear trim
stat_smooth(method = 'lm') +
#using 99 percentile
xlim(0, quantile(diamonds$carat,0.99)) +
ylim(0, quantile(diamonds$price, 0.99))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Response: Non linear relationship, maybe exponential, maybe something else. Price increases with numbers of carats. If we paint a linear trend line, we can see that it doesn´t go through the center of the data in some key places. ***
Notes:
Notes:
Notes: Ggpairs provide: -lower triangle grouped histograms for qualitative-qualitative pairs (using the y as grouping factor) and scatter plots for quantitative-quantitative pairs -upper triangle grouped histograms for qualitative-qualitative pairs (using the x as grouping factor), boxplots for qualitative-quantitative pairs and correlation for quantitative-quantitative pairs.
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.4.1
library(scales)
## Warning: package 'scales' was built under R version 3.4.1
library(memisc)
## Warning: package 'memisc' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.4.3
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set otherwise it will take too long
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
#ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
ggpairs(diamond_samp,
# to make the points smaller and clearer
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#axisLabels = 'internal')
What are some things you notice in the ggpairs output? Response:
Notes:
plot1 <- qplot(data = diamonds, x = price, binwidth = 100, fill = I('#099DD9')) +
ggtitle('price')
plot2 <- qplot(data = diamonds, x = price, binwidth = 0.01, fill = I('#F79420')) +
ggtitle('price (log10)') +
scale_x_log10()
plot3 <- qplot(data = diamonds, x = log10(price), binwidth = 0.01, fill = I('#F79420')) +
ggtitle('log10(price)')
library(gridExtra)
library(grid)
grid.arrange(plot1, plot2, plot3, ncol = 2)
Notes:
On the log scale prices look less dispersed at the high end of carat size and price.
qplot(carat, price, data = diamonds)+
scale_y_continuous(trans = log10_trans())+
ggtitle('Price (log10) by Carat')
We can do better (we are trying to get a linear representation)
# this is an R function. We need the inverse conversion too.
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
Due mainly to rounding we have multiple points overlapped which obscures some of the density and sparsity of our data, the key points.
head(sort(table(diamonds$carat), decreasing = T))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
You can make your points smaller, jittering them and adding some transparency.
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.
# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 0.5, size = 0.75, position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
Notes:
Alter the code below.
# Adjust the code below to color the points by clarity.
# A layer called scale_color_brewer() has
# been added to adjust the legend and
# provide custom colors.
# See if you can figure out what it does.
# Links to resources are in the Instructor Notes.
# You will need to install the package RColorBrewer
# in R to get the same colors and color palettes.
# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).
Response: For a fixed number of carats we can see that diamonds with lower clarity al almost always cheaper than diamonds with higher clarity.
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).
Response: Most of the data is from diamonds with ideal cut, but we do see that most of the fair cuts are in the lower prices for a given carat.
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color',
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).
Response:Color does seem to explain some variance in the price.
Notes: In R we can create models using the lm() function. x lm(x~y) with x the outcome variable and y the explanatory variable. Response: In our case x = log(price) y = carat ^ 1/3 (cube root) ***
Notes: The I wrapper around the variables stands for as-is. In this case it tells R to use the expression inside the I function to tranform a variable before using it in the regression. This is instead of instructing R to interpret these symbols as part of the formula to construct the design matrix for the regression.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
# updating the model
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
# mtable produces a table of estimates for several models.
mtable(m1, m2, m3, m4, m5, sdigits = 3)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## ============================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## --------------------------------------------------------------------------------------------
## R-squared 0.924 0.935 0.939 0.951 0.984
## adj. R-squared 0.924 0.935 0.939 0.951 0.984
## sigma 0.280 0.259 0.250 0.224 0.129
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 -1837.416 4235.240 34091.272
## Deviance 4242.831 3613.360 3380.837 2699.212 892.214
## AIC 15930.999 7270.637 3690.832 -8442.481 -68140.544
## BIC 15957.685 7306.220 3761.997 -8317.942 -67953.736
## N 53940 53940 53940 53940 53940
## ============================================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes: OUR MODEL IS (see column 5, the last model in the previous result)
log(price) = 0.415 + 9.144 * carat^(1/3) - 1.093 * carat….
Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response:
Notes:
#install.packages('bitops')
#install.packages('RCurl')
library('RCurl')
## Warning: package 'RCurl' was built under R version 3.4.3
## Loading required package: bitops
##library('bitops')
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.
# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).
# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!
# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.
# This exercise WILL BE automatically graded.
# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)
# Your task is to write the code to create the models.
#
# DO NOT ALTER THE CODE BELOW THIS LINE (Reads in a sample of the diamondsbig data set)
#===========================================================================================
# diamondsBigSample <- read.csv('diamondsBigSample.csv')
# ENTER YOUR CODE BELOW THIS LINE. (Create the five models)
#===========================================================================================
diamondsbig$logprice = log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)),
data=diamondsbig[diamondsbig$price < 10000 &
diamondsbig$cert == "GIA",])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color,
## data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
##
## ================================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------------
## (Intercept) 2.671*** 1.333*** 0.949*** 0.529*** -0.464***
## (0.003) (0.012) (0.012) (0.010) (0.009)
## I(carat^(1/3)) 5.839*** 8.243*** 8.633*** 8.110*** 8.320***
## (0.004) (0.022) (0.021) (0.017) (0.012)
## carat -1.061*** -1.223*** -0.782*** -0.763***
## (0.009) (0.009) (0.007) (0.005)
## cut: V.Good 0.120*** 0.090*** 0.071***
## (0.002) (0.001) (0.001)
## cut: Ideal 0.211*** 0.181*** 0.131***
## (0.002) (0.001) (0.001)
## color: K/L 0.123*** 0.117***
## (0.004) (0.003)
## color: J/L 0.312*** 0.318***
## (0.003) (0.002)
## color: I/L 0.451*** 0.469***
## (0.003) (0.002)
## color: H/L 0.569*** 0.602***
## (0.003) (0.002)
## color: G/L 0.633*** 0.665***
## (0.003) (0.002)
## color: F/L 0.687*** 0.723***
## (0.003) (0.002)
## color: E/L 0.729*** 0.756***
## (0.003) (0.002)
## color: D/L 0.812*** 0.827***
## (0.003) (0.002)
## clarity: I1 0.301***
## (0.006)
## clarity: SI2 0.607***
## (0.006)
## clarity: SI1 0.727***
## (0.006)
## clarity: VS2 0.836***
## (0.006)
## clarity: VS1 0.891***
## (0.006)
## clarity: VVS2 0.935***
## (0.006)
## clarity: VVS1 0.995***
## (0.006)
## clarity: IF 1.052***
## (0.006)
## ------------------------------------------------------------------------------------------------
## R-squared 0.888 0.892 0.899 0.937 0.969
## adj. R-squared 0.888 0.892 0.899 0.937 0.969
## sigma 0.289 0.284 0.275 0.216 0.154
## F 2700903.714 1406538.330 754405.425 423311.488 521161.443
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -60137.791 -53996.269 -43339.818 37830.414 154124.270
## Deviance 28298.689 27291.534 25628.285 15874.910 7992.720
## AIC 120281.582 108000.539 86691.636 -75632.827 -308204.540
## BIC 120313.783 108043.473 86756.037 -75482.557 -307968.400
## N 338946 338946 338946 338946 338946
## ================================================================================================
# DO NOT ALTER THE CODE BELOW THIS LINE (Tables your models and pulls out the statistics)
#===========================================================================================
suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))
models <- mtable(m1, m2, m3, m4, m5)
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
# We need to exponentiate the result of the model wince we took the log of price
exp(modelEstimate)
## fit lwr upr
## 1 5040.436 3730.34 6810.638
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
Notes:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!